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^ ■ This article studies exponential families £ on finite sets such that the in- 

O \ formation divergence D{P\\8) of an arbitrary probability distribution from 

£ is bounded by some constant D > 0. A particular class of low-dimensional 
I exponential families that have low values of D can be obtained from parti- 

tions of the state space. The main results concern optimality properties of 
■ these partition exponential families. Exponential families where D = log(2) 

are studied in detail. This case is special, because if D < log(2), then S 
contains all probability measures with full support. 
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1 Introduction 



^ . Let X he a. finite set of cardinality A^, and denote by P{X) the set of probability 

00 '. distributions on X. The information divergence D{P\\Q) is a natural distance measure 

^ I on P{X). For any exponential family 8 on X (as defined in Section [2]) and any P G P{X) 

write Ds{P) = inigtzs D{P\\Q). This article discusses the following question: 

• Let D > 0, and choose a partial order on the exponential families. Which exponen- 
tial families are minimal among all exponential families S satisfying maxDg < Dl 
What is the answer to this question under further constraints on £1 



This question is related to finding the maximizers of the information divergence from 
an exponential family, a problem which was first formulated by Nihat Ay in [1]. See [H] 
for an overview and further references. The present work builds on recent progress in [15] 
and [ig. 

There are at least two partial orders of interest: 

[i) The partial order induced by the dimensions of the exponential families. 
(m) The partial order by inclusion. 
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The partial order (z) is particularly important for applications, since the dimension 



of an exponential family is one of the most important invariants that determine the 



complexity of all computations. The partial order (m) can be seen as a "local relaxation" : 
A candidate exponential family 8 is only compared to "similar" exponential families, 
contained in £. 

Definition 1. Let % he a. set of exponential families. An exponential family £ E l-L 
is called inclusion D-optimal among T-i for some D > maxDg if every G "H strictly 
contained in S satisfies maxZ^^ < D < maxDs/. An exponential family S eH is called 
dimension D-optimal among "H if every exponential family 8' oi smaller dimension 
satisfies maxZ^^ < D < maxD^/. Exponential families that are inclusion or dimension 
D-optimal among "H for some D are also called inclusion or dimension optimal among "H, 
without reference to D. If "H equals the set of all exponential families, then the reference 
to H may be omitted in all definitions. Let 

DN.ki'H) = min{maxZ}£- : G H is an exponential family of dimension k on [A^]} . 

As an example, the set may be the set of hierarchical models, the set of graphical 
models or the set "Hi of exponential families containing the uniform distribution. Obvi- 
ously, any dimension optimal model is also inclusion optimal. The converse statement 
does not hold, see Example [27] below. 

A D-optimal exponential family S can approximate arbitrary probability measures 
well, up to a maximal divergence of D. Yaroslav Bulatov proposed to use such exponen- 
tial families in machine learning (personal communication), for example when using the 
minimax algorithm [T7j by Zhu, Wu and Mumford or the feature induction algorithm [5] 
by Delia Pietra, Delia Pietra and Lafferty. Both algorithms inductively construct an 
exponential family by adding functions ("features") to the tangent space in order to 
approximate a given distribution. Applications of the results of the present paper to 
machine learning will not be discussed in here, but in a future work. 

One motivation to restrict the class "H of exponential families is that the learning 
system may not be able to represent arbitrary exponential families. Another motivation 
is given by Jaynes' principle of maximum entropy [S], which suggests to use the class 
"Hi of exponential families with uniform reference measure. 

This paper also introduces the class of partition models (see Section [3]): A probability 
measure P belongs to the partition model associated to a partition X' = {X^, . . . , ) 
if the restriction of P to each block A"* is uniform. Conjecture [SHlrelates partition models 
to the above question: 

Conjecture 1291 -D^v,*; = logf^;^], and the dimension D]\f ^-optimal exponential families 
containing the uniform distribution are partition models. 

The results in Section S] show that the conjecture is true if < 2, and Theorem [281 

proves the conjecture if k + 1 divides N. 

This paper is organized as follows: Section [2] collects the necessary preliminaries about 
exponential families and the information divergence. Section E] introduces partition 
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models and studies their basic properties. log(2)-optimal exponential families £ are 
studied in Section |H Section [5] presents results on D-optimal exponential families for 
arbitrary D. 

2 Preliminaries 

This section collects known facts that are needed in later sections. It starts with some 
notions from matroid theory before defining exponential families, the information diver- 
gence and hierarchical models. The last part discusses the function D^, which arises 
naturally when studying the maximizers of D^. 

2.1 Circuits 

This section recalls some elementary notions from the theory of matroids. Only repre- 
sentable matroids will play a role, but nevertheless the language of abstract matroids is 
useful. See [13] for an introduction. 

Definition 2. Let A/" be a linear subspace of M"^. The support of u G A/" is defined as 
supp(n) := {x G A" : u{x) ^ 0}. A vector u G A/" \ {0} is called a circuit vector if and 
only if for any u E M satisfying supp(M) C supp(w) there exists a G M such that u = av. 
In other words, circuit vectors are vectors with minimal support. The support supp(n) 
of a circuit vector u is called a circuit. A finite set C C A/" is a circuit basis if and only 
if the map m G C H- supp(M) is injective and maps onto the set of circuits. 

Lemma 3. For every nonzero vector u E M and any x E X such that u{x) ^ there 
exists a circuit vector c E M such that supp(c) C supp(u) and c{x) ^ 0. 

Proof. Let c be a vector with inclusion- minimal support that satisfies supp(c) C supp(n) 
and c(x) 7^ 0. If c is not a circuit vector, then there exists a circuit vector d with 
supp(c') C supp(c). A suitable linear combination c + ac', a G M gives a contradiction 
to the minimality of c. □ 

It follows that any circuit basis of M contains a spanning set. 

2.2 Exponential families and the information divergence 

In this work only exponential families on a finite set X are studied, for the information 
divergence from a finite-dimensional exponential family on an infinite set is usually 
unbounded, cf. Theorem [23 See [2] and [3] for an introduction to exponential families 
and the information divergence. 

Let T be a linear subspace of Mf^ containing the constant function, and let i/ be a 
strictly positive measure on X . The set £ = £^f oi all probability measures on X of the 
form 

P,{x) = ^e''(^) (1) 
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is called an exponential family, u is a reference measure, and T will be called the extended 
tangent space of S. The extended tangent space carries its name since its image modulo 
the constant functions is isomorphic to the tangent space of the manifold S at any point. 
The orthogonal complement A/" := T"*" will be called the normal space of S. The normal 
space is orthogonal to the tangent space of S at any point P E £ with respect to the 
Fisher metric at P. The topological closure of £ will be denoted by £. 

The exponential family £^f can be parametrized as follows: If ai, . . . , O/^ G M"^ form 
a spanning set of T, then £ consists of all probability distributions of the form 

n(x) = ^exp[^eA(x)]. (2) 



■1=1 



In this formula ^ G M'* is a vector of parameters and Zq ensures normalization. The 
matrix A = {ai{x))i^x £ M'*^'^ is called a sufficient statistics of £. The linear map 
corresponding to A is called the moment map, denoted by ha- The columns of A will be 
denoted by A^, x E X. The normal space of £ equals Af = {u E ker A : J2x "^i-^) ~ 0}- 
The convex hull of {A^ : x G X} is a polytope called the convex support of £. This 
polytope is independent of the choice of A up to an affine transformation. 

Any function u G M"^ can be decomposed uniquely as a difference u = — u~ 
of non-negative functions such that supp(M"'") fl supp(M~) = 0. The following implicit 
description of an exponential family is useful in many contexts. 

Theorem 4. Let £ he an exponential family with normal space M and reference mea- 
sure V , and let C he a circuit hasis of M . A prohahility measure P on X helongs to £ if 
and only if P satisfies 

n(^)""-n(^)""". for^Uu^u^-u-.C. (3) 

x(^X ^ ^ ' ^ xdX \ ^ I / 

Proof. See [TBI Theorem 10]. □ 
Let £\, . . . ,£c^^{X). The mixture oi £\, . . . ,£c\& the set of probability measures 

P = ^ AiPi : Pi G ^1, . . . , G £e and A G M|, 5^ A,, = \\. 
i=\ i=\ J 

Corollary 5. Let £ he an exponential family with normal space M . Let y G X . If 
every circuit vector c G A/" satisfies supp(c) C y or supp(c) 'i^ X \ y, then £ equals the 
mixture of£nP{y) and£nP{X\y). 

Proof. For any probability measure P and subset y ^ X define the truncation P^ as 
follows: If P(3^) > 0, then 



P^( 



x] 



p^P(x), if X G 3^, 
0, else; 
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otherwise let P-^ be an arbitrary probability distribution on y. By Theorem |U a prob- 
ability measure P G P{X) with full support lies in S if and only if its truncations P-^ 
and P^\y lie in n P(3^) and £ f] P{X \ y), respectively. □ 



The corollary can be reformulated as follows, using terminology from matroid theory: 
If Xi, . . . are the connected components of the matroid of A/", then S equals the 
mixture of Si, . . . ,Sc, where Si = S (1 P{Xi)° is an exponential family on Xi for i = 



The information divergence (also known as the Kullback-Leibler divergence or relative 
entropy) of positive measures P, Q is defined as 



with the convention that OlogO = 01og(0/0) = 0. It is finite unless supp(P) is not 
contained in supp(Q). If u equals the counting measure on X (i.e. = 1 for all x), then 
D{P\\v) equals minus the Shannon entropy H{P). If P and Q are probability measures, 
then D{P\\Q) is strictly positive unless P = Q. 

Let S be an exponential family. For any probability measure P on X there is a unique 
probability distribution Pe E S such that D{P\\P£) = infgg^ P)(P||Q), see [1]. The 
measure P^ is called the (generalized) r I -projection of P to S or the (generalized) MLE. It 
can also be characterized as the unique probability measure P^ E S such that P—Ps G M . 
Alternatively, Pg minimizes the function D{Q\\v) on {Q e V{X) : P — Q E A/"}. In 
particular, if v is the counting measure, then P^ maximizes the entropy. 

2.3 Hierarchical loglinear models 

Let Xi, . . . , Xn, be finite sets of cardinality \Xi\ = Ni, and let X = Xi x ■ ■ ■ x Xn- For 
any subset S C [n] let Xs = Xjg^Ai. The restrictions Xi : X ^ Xi to the subsystems 
can be viewed as random variables, and hierarchical models can be used to study the 
relationship of these discrete random variables. This section summarizes the main facts 
which are needed in the following. See [TU] and [B] for further information. 

Definition 6. For any family A of subsets of [n] let S'^ be the set of all probability 
measures P G P{X)° that can be written in the form 



where each fs is a non-negative function on X that depends only on those components of 
X lying in S. In other words, fsix) = fs{y) for all x = {xi)^^^, y = (?/j)"=i G X satisfying 
Xi = yi for all i E S. The hierarchical exponential family S^ of A with parameters Ni, 
N2, . . . , Nn is defined as S'^ fl P{X)° . The closure of Sa (which equals the closure of 
S'^) is called the hierarchical model of A with parameters A''i, A''2, . . . , Nn- 



1, . . . ,c. 





pi^) = n fsix), 



(5) 
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At first sight one might think that = Sa- Unfortunately, this is not true, see [7]. 
For certain applications, when the factorizability probabihty is important, one might 
want to call S'^ a hierarchical model. When studying optimization problems it is more 
important that the models are closed. 

For any S C {1, . . . ,n} the subset of M"^ of functions that only depend on the S- 
components can be naturally identified with M'^-^. The projection X ^ Xs induces a 
natural injection M'^-^ — )■ M"*^. 

It is easy to see that hierarchical exponential families are indeed exponential families: 
Namely, ^ implies that consists of all P G P(A')° that satisfy 

(log(P(x))),,;,G5^M^-CM^. 

Therefore, is an exponential family with uniform reference measure and extended 
tangent space T =EsGA^'^''- This vector space sum is not direct, since every summand 
contains 1. There is a natural sufficient statistics: The marginalization maps -ng '■ ^ 
^■^•5 defined for S* C {1, . . . , n} via 

'^s{v){x) = ^ v{y) 

y&X:yi=Xi for all idS 

induce the moment map 

VTA : G {TTs{v))seA G 

where © denotes the (external) direct sum of vector spaces. 

Lemma 7. Let A be a collection of subsets of [n], and let K = Uj^a^- The marginal 
polytope of A is (affinely equivalent to) a 0-1-polytope with Hieii' ^« vertices. 

Proof. The moment map tta corresponds to a sufficient statistics A/\ that only has entries 
and 1, so is a 0-1-polytope. The set of vertices of is a subset of {A^ : x G X}. 
Let X = {xi)^^i,y = G X. If = yi for all i E K, then = Ay, so has at 

most YlieK vertices. If Xi ^ yi for some i E K, then A^ ^ Ay, so the set {A^ : x G X} 
has cardinality riieir^*- Since this set consists of 0-1-vectors and since no 0-1-vector 
is a convex combination of other 0-1-vectors, it follows that the set of vertices of 
equals {Ar^ : x G X} and has cardinality YlieK ^i- '-' 

2.4 The function Dg 

The function is related to the function 



Dsiu) = ^n(x) log 



|u(a;)| 



defined on Af [15]. The function satisfies D^lau) = aDs for all a G M and u G M . 
It will mostly be considered on a subset d\Jj^ of N , defined as follows: 
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Definition 8. For any v G M'*' and Z X write v{Z) := X^^g^ ^'(^)- Let 

d\]j^ ■= [ueU : u^{X) = u~{X) = 1} . 

The map : u H- u"*" maps SUat to a subset of P(A'). A probability distribution in 
the image of is called a kernel distribution. 

In the other direction there is the natural map : P(^) \ — ^ A/", defined via 



(p-p,)+W 

The denominator makes sure that the image of lies in d\Jx- Since P = P^- if and 
only if P G the map is well-defined on P{X) \ S. 

Theorem 9. Let S be an exponential family with normal space JV ^ 0. The map 
restricts to a bisection from the set of local maximizers of Ds to the set of local maximizers 
of Ds. An inverse is given by the restriction of the map '^^ : u ^ . If P E P(^) 
and u G d\Jj^ are local maximizers of and D^, respectively, then 

DeiP) = log(l + exppf (^^(P)))) and De{u+) = log(l + exp(Deiu))). 

Proof. See |12i Theorem 1]. □ 

See [12] and [H] for further relations between the functions and D^. 

Corollary 10. Let £ be an exponential family. If S ^ P{X), then maxD^ > log(2). 

Proof. Let u G dUj^j- be a global maximizer of D^. Since Ds^—u) = —Dsiu) the maximal 
value Dgiu) is non-negative. Hence D^iu'^) = log(l -|- exp^D^^u))) > log(2). □ 

It is straightforward to compute the first-order criticality conditions of D^'- 

Proposition 11. Let S be an exponential family with normal space M , let u G d\Jj^ be 
a local maximizer of Ds, and let y = supp(m). The following statements hold: 

(i) v(y) = for all V G 

(a) Let Pg be the rl -projection of and u~ , and let v G M . Then 



J2 v{x) log < v+{Z')De{vo). (6) 



Proof See [I2] or [H Proposition 3.21]. □ 
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3 Partition models 



Partition exponential families are convex exponential families. The information diver- 
gence from convex exponential families has been studied in [11]. Apart from this, parti- 
tion exponential families do not seem to have been studied before, despite their peculiar 
properties. In other contexts the name "partition model" is used for other mathematical 
objects, but there seems to be little danger of confusion. 

Definition 12. A partition X' of A" is a family X' = {X^, X^,..., X^'} of nonempty 
subsets X' CX such that X = X^UX^U- ■ -UX^' and X'nX^ = for all 1 < i < j < N'. 
The subsets X'^ C X are called the blocks of the partition X'. For any x E X the block 
X^ containing x is denoted X^. 

The coarseness c{X') of a partition X' is the cardinality of the largest block of X'. A 
partition X' is called homogeneous if all blocks of X' have the same cardinality c{X'). 
Partitions are in bijection with equivalence relations, the blocks of a partition corre- 
sponding to the equivalence classes. The equivalence relation induced by the partition 
X' is denoted In other words x,y E X satisfy x ^x' y if and only if x and y lie in 
the same block of X'. 

Definition 13. Let X' be a partition of X. Denote M"^' the set of functions 1!} : X W 
such that X y implies d{x) = '^{y). The exponential family £x' with uniform 
reference measure and extended tangent space M*^ is called the partition exponential 
family of X', and Sx' is the partition model of X'. 

Partition models are, in fact, also linear families: Sx' equals the intersection of P{X) 
with the linear space M"^ . In particular, partition exponential families are convex ex- 
ponential families. Convex exponential families have been studied by Ay and Matus 
in [TT], which contains more detailed arguments for the following calculations. It fol- 
lows from [TTl Proposition 1] that a convex exponential family is a partition exponential 
family if and only if it contains the uniform distribution. 

Remark 14. Partition models can be used to model symmetries. This was first noted by 
Jurfcek, who used this idea to compute the global maximizers of for the multinomial 
models [H]. If a symmetry group G acts on X, then it induces a partition X^ of X 
into orbits X^, . . . , X^ . The action of G extends naturally to an action on M'^. Any 
exponential family that consists of G-invariant probability measures is a subfamily of 
SxG (such exponential families are called G- exchangeable in [9]). Conversely, an arbitrary 
partition model Sx' arises in this way from the group of all permutations g of X such 
that g{X') = X' for all X' G X'. 

Lemma 15. An exponential family with uniform reference measure and sufficient statis- 
tics A e M'^^'^ is a partition exponential family if and only if its convex support is a 
simplex with vertex set {A^ : x G X}. 

Proof. A sufficient statistics of Sx' is given by the characteristic functions = Ix^ of 
the blocks of X'. Any column of A = {ai{x))i^x is a unit vector, and therefore the convex 
support is a simplex. 
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In the other direction define an equivalence relation ~ on via x ~ ?/ if and only 
if = Ay. Then S agrees with the partition exponential family of this equivalence 
relation. □ 



For partition models the mapping P M- P^: is easy to compute: The equation AP = 
APs translates into P{X^) = Pe\x^) ioi i = I, . . . , N' . Therefore, 

P^(a;) = P/'(x)P(A'^), for all xG A", (7) 

where P^"" denotes the truncation of Pg to . Since Ps maximizes the entropy subject 
to d?]), it follows that P^"" = j^^lx^ is the uniform distribution on X^ . Hence the 
rJ-projection map P ^ Ps averages over the blocks of the partition. It follows that 

N' ^ N' 

i=l ' ' i=l 

As a consequence: 

Lemma 16. If S is a partition model of a partition X^, . . . , X^' of coarseness c, then 
maxD^ = log(c). A probability measure P G P(^) maximizes if and only if the 
following two conditions are satisfied: 

(0 P{X') > only if\X'\ = a. 

(a) P'^' is a point measure for all i such that \X''\ = c and P{X^) > 0. 

Corollary 17. Let £ be the partition model of a partition X' of coarseness c, and let 
Z be the union of the blocks of X' of cardinality c. Then any Q & £ with support 
contained in Z is the r I -projection of some global maximizer of Dg. In particular, if X' 
is homogeneous, then any Q ^ £ is the r I -projection of some global maximizer of D^. 

Proof. For any X^ G X' of cardinality c choose a representative Xi G X^. Define P G 
P(A') by P{X') = Q{X') and P^' = 6^^ for all i such that \X'\ = c. Then P^ = Q, so 
the statement follows from Lemma Uni □ 

Remark 18. Composite systems have natural homogeneous partitions, which lead to 
hierarchical models as defined in Section [2) Suppose that X = Xi x ■ ■ ■ x Xn and let 
K C {1, . . . Then K induces an equivalence on X via x ~x y if and only if 
Xi = yi for all i & K. The equivalence classes of form a homogeneous partition X^ 
of X of coarseness Ilri^i^^j- The corresponding partition model £k consists of those 
probability distributions P satisfying P(x) = P{y) whenever x U- Therefore, £k 
equals the hierarchical exponential family £{k}- Conversely, any homogeneous partition 
X' can be used to find a bijection of X with a composite system XiX X2, where Xi = X' 
and X2 G X'. Then the partition X' arises from r^x, where K = {!}. 
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4 Exponential families with maxDs = log(2) 



By Corollary [TU] the maximal value of is at least log(2) unless S = P{X). This 
section studies exponential families S where maxD^ = log(2). For such an exponential 
family, any kernel distribution is a local maximizer of D^. Furthermore, D^lu) = for 
all u G A/" (even if u ^ d\Jj\j-). The main results are: 

Theorem 19. Let S be an exponential family on a finite set X of cardinality N. If 
maxD^ = log(2), then the dimension of E is at least [^1 ~ 1- 

Theorem 20. Let X be a finite set of cardinality N, and let £ be an exponential family 
on X of dimension [^1 ~ 1 satisfying maxD^: = log(2). If N is even, then S is a 
partition model. If N is odd, then there is a set Z (1 X of cardinality three, a partition 
model £x\z on X \ Z and a one- dimensional exponential family £z on Z such that 
Yn.&:^D{-\\£x\z) = log(2) = ina.xD{-\\£z), and the closure £ equals the mixture of £x\z 
and £z- If £ contains the uniform distribution, then £ is a partition model. 

Proposition 21. Let X = {1,2,3}. For any u G Mf^ such that Ui + M2 + ^3 = 
there exists a unique exponential family £ on X with normal space M = Mn such that 
maxDs = log(2). 

The proofs of the three results will be given below after a series of preliminary lemmas. 
Under the additional assumptions that is even Theorem |2D] has a simpler proof, see 
Theorem 

Let £ be an exponential family with sufficient statistics A and normal space A/". 

Lemma 22. For any vq, Vi, . . . ,Vs E Af let Z = supp(fo) \ U^^^^ supp(f j). Suppose that 
maxDf = log(2). Then 

v{x) log = and v{x) = for all v G Af. 

Proof. The proof is by induction on s. Let s = 0. Any Vq & A/ satisfies D^Ivq) = 
and is a local maximizer of D^. The equality v{Z) = for all f G A/" follows from 
Proposition fTTf(i)| Let Z' = X \ Z. Proposition [TT (m) implies that 



J2 v{x) log < v^{Z')De{vo) = 



xez' 

for all V G A/". Together with the same inequality with v replaced by —v it follows that 

Exez' ^'i^) log ^ = 0. Hence Exez^i^) log ^ = Dsiv) - Y.x&z' log ^ = 0- 
If s > 1, then let 3^ = A' \ supp(ws). Let £' be the exponential family on y with 
reference measure the restriction v\y oi v to y and normal space M' = {v\y : v & A/"}. 

The case s = implies Ds'{w) = Ds{v) - Yjx&upp(vs) ^i^) log = for all w = v\y e 
M' . Therefore, the statement follows from induction. □ 
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Let X_ = {x E X : v{x) ^ for some v G A/"}. Define a relation ~ on ^ via 

X ~ ?/ <^==^ v{y) 7^ for all G A/" such that f (x) 7^ 0. 

It is easy to see that ~ is an equivalence relation: If there exist v,w E N' such that 
''^(y) 7^ = "^(3^) 7^ 7^ w{y), then m := v{y)w — w{y)v G A/" satisfies 

u{y) = ^ and so ~ is symmetric. Transitivity can be shown similarly. In the 

language of matroid theory the equivalence classes are the coparallel classes. 

Lemma 23. A subset Z C X is an equivalence class of ~ if and only if there exist 
circuits ctq, ai, . . . , 0"^ of such that 

Z = ao\ Uj^^aj, 

and such that Z\a E {0, Z} for all circuits a of M . 

Proof. \i X ^ y for some ?/ G A", then there exists & v E M such that v{x) 7^ and 
v{y) = 0. By Lemma [3] there exists a circuit with the same property. Conversely, if 
y ^ X, then ?/ G a for any circuit a such that x G a. □ 

Let C G M'^^'^ be a matrix such that the rows ci, Cc of C form a circuit basis of A/". 
Since each circuit basis contains a basis, the rank of C equals the dimension of N". The 
columns of C are denoted by {Cx}xex- 

Lemma 24. Let Z be an equivalence class of ^. The rank of the submatrix C\z con- 
sisting of those columns indexed by Z is one. 

Proof. Let Z X. If the rank of C\z is larger than one, then there exist two circuit 
vectors ci,C2 such that ci\z and C2\z are linearly independent and have support Z. Let 
X E Z. Let V = C2(x)ci — Ci(x)c2 G Af. Then v\z 7^ and supp(v|2) Z \ {x}. 
Therefore, Z is not an equivalence class of ~. □ 

The main argument of the last proof can be reformulated in terms of the elimination 
axiom of oriented matroid theory, cf . [13] . In the language of matroid theory Lemma [24] 
states that the coparallel classes of a matroid have corank one. 

Proof of Theorem [7P1 Suppose maxZ^^ = log(2). By Lemma [Ml the rank of C is 
bounded from above by the number of equivalence classes of ~. Let Z be an equiv- 
alence class of ~. By definition, the submatrix C\z E Mf^^^ is not the zero matrix. By 
Lemmas [22] and [231 the rows Ci\z of C\z satisfy '^^ez'^ii^) ~ ^- Hence each equivalence 
class must contain at least two elements. Therefore, the rank of C, which equals the 
codimension of is bounded from above by [yj , and so the dimension of £ is bounded 
from below by - 1 - [f J = [f ] - 1. □ 

Lemma 25. // the dimension of M equals the number of equivalence classes of ~, 
then the equivalence classes are the circuits of M . In other words, the circuit vectors 
ci, . . . ,Cc of a circuit basis are in bisection with the equivalence classes Zi, . . . ,Zc, such 
that Zi = supp(cj). Hence £ is the mixture of £1, . . . , £c, where £c is the exponential 
family £ fl P{Zi)° . 
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Proof. Let Zi, . . . , Zc' be the set of equivalence classes of ~. Reorder X such that the 
equivalence classes are given by consecutive numbers. Let C be the matrix obtained 
from C by doing a Gauss elimination through row operations. By assumption C has 
dim A/" = c' nonzero rows. By Lemma [2U the ith row Cj of C has support contained in 
ZiU ■ ■ ■ U Zc'. In particular, supp(cc') = Zc'. Therefore, Cc' is a circuit vector, li v ^ Af 
has v{x) 7^ for some x E Zc', then v = v — j^f^Cc'{x) satisfies supp(?)) = supp(t') \ Zc'. 
Hence no other circuit intersects Zc'. By induction, supp(cj) equals an equivalence class 
of ~ for each i. The first statement follows from supp(cj) 7^ supp(cj) for 1 < z < j < c. 
The last statement is a consequence of Corollary [51 □ 

Proof of Theorem\2^ Assume that the dimension of 8 equals [y] — 1. By the proof of 
Theorem [TUl there must be m := [yj equivalence classes of ~. If is even, then each 
equivalence class has cardinality two. If A^ is odd, then there may be one equivalence 
class Z of cardinality three. In this case, reorder X such that Z = {N — 2, N — 1, N}. By 
Lemma 1251 there exists a circuit vector c E N such that supp(c) = Z. Assume without 
loss of generality that cn^2 and cn^i are positive and that cn = —{cn^i + cn^2) = — 1- 
Then 

N 

^ Cj logical = -/l(cAr„i,CAr_2) 7^ 0, 
i=N-2 

where h{p, q) is the entropy of a binary random variable with probabilities p, q. There- 
fore, if A^ is even or if 1 is a reference measure of then all equivalence classes of ~ 
have cardinality two. 

By Lemma [25] there are exponential families £i,...,£c such that £i C P(2j)° for 
i = 1, . . . , c and such that £ is the mixture of . . . , £c. For i = 1, . . . , c there is a 
unique circuit vector with support Zi, hence £i ^ P{Zi)°, so £i has dimension \Zi\ — 1. 
If \Zi\ = 2, then £i consists of the uniform distribution ^1^- on Zi, so £i is a partition 
model, and also the mixture of £{ for those i satisfying \Zi\ = 2 is a partition model. □ 

Proof of Proposition Let £^ be a one- dimensional exponential family with normal 
space Mm. Without loss of generality assume that and u~ are probability measures. 
By Theorem [9] the set of local maximizers of consists of and u~ , and both are 
projection points. £ satisfies maxD^ = log(2) if and only if {u'^)s = = j{u~^ +u~), 

which happens if and only if + m" is a reference measure of £, proving existence and 
uniqueness of □ 

5 Optimal exponential families 

Corollary [TOl says that maxDs > log(2) for all exponential families £ ^ P{X)°. There- 
fore D-optimality is only interesting for D > log(2). The case D = log(2) was studied 
in Section [H where it was shown that DN,k = log(2) if and only if [f ] - 1 < A; < A^. 
This condition is equivalent to \-^~\ = 2. Many log(2)-dimension optimal exponential 
families are partition exponential families. 
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Example 26. Any zero-dimensional exponential family S = {u} is dimension-optimal. 
The function P i— D(P||z/) is convex on the probability simplex P{X) and attains its 
maximum at a vertex of P{X), which corresponds to a point distribution. Therefore, 

maxD^ = max{— loglu^) : x G X} > log 

Hence D^i = log(A^), and S is D-optimal if and only if z/^ > for all x E X. Zero- 
dimensional exponential families are the dimension D-optimal exponential families for 
D > log \ In general, they are not the only inclusion D-optimal exponential families, 
see Example [271 

Example 27. Let X = {1,2,3}. Any zero-dimensional exponential family S = {u} 
satisfies maxD^: > log(3). Therefore, if log(2) < D < log(3), then the dimension 
-D-optimal exponential families are one-dimensional. The normal space A/" of any one- 
dimensional exponential family S is spanned by a single element u, which can be taken 
to be normalized, such that dlJj^ = {±m}. By Theorem [9] the set of local maximizers 
of D£ equals {u~^,u~}. Let Pe = {u~^)£ = {u~)s, then Ps = jjiu^ + (1 — for some 
< /i < 1. Hence Dgiu'^) = — log/z and Dgiu") = — log(l — /i). It follows that £ is 
dimension D-optimal if and only < < 1 — e~^. Alternatively, using Theorem [HI 
S is dimension /^-optimal if and only if — log(e^ — 1) < D^i^u) < log(e'^ — 1). 

If D > log(3), then the dimension D-optimal exponential families are zero-dimen- 
sional, consisting of a single point {z/} such that min{z/i, U2, 1^3} > e~^. There are also 
one-dimensional inclusion D-optimal exponential families: Consider, for example, the 
exponential family S with sufficient statistics A = (0, 1, 2) and reference measure u = 
(1,4, 1). The two local maximizers are = 62 and u~ = ^(^i-l-^s). Their r/-projection 
is Ps = |z/. Hence D^^u'^) = log | and Ds^u') = log 3, and so maxD£: = log 3. The 
monomial parametrization of S is 

where G ]R> and = 1 + 4C, + ^"^^ Consequently, S does not contain the uniform 
distribution. Therefore, any point P E £ satisfies maxD(-||P) > maxD^. 

The following theorem generalizes the special case of Theorem [20] when is even. 

Theorem 28. Let X he a finite set of cardinality N. Then D^^k > \og{N/{k + 1)) 
for alio < k < N. If £ is a k-dimensional exponential family that satisfies maxD^ = 
\og{N/{k + 1)), then £ is a partition model of a homogeneous partition of coarseness 
N/{k + 1). In particular, if N is divisible by {k + 1), then D^^k = log(A^/(/c -|- 1)), and 
the dimension Dj^ i^-optimal models are partition models. 

Proof. First assume that £ G "Hi. Let A be a sufficient statistics of £. The moment map 
IT A maps the uniform distribution Q = to a point in the relative interior of M^. By 
Caratheodory's theorem there are k + 1 vertices Ar^g, . . . , A^.^. of and Aq, • • • , G M> 
such that TiAiQ) = Yl'i=o ^i^x, and Yli=o K = 1- Let P = XlLo "^i^^,, then Q = Pg. By 
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the Pythagorean theorem, maxDs > De{P) = H{Q) - H{P) > log(iV) - log(A; + 1), 
proving the first assertion. 

If equahty holds, then Aq = ■■■ = = Let x E X \ {xq, . . . ,Xk}- For 

i G {0, . . . ,k} let Ci be the convex hull of A^^^, . . . , Ax^_j^, ^xj.u • • • ? and A^. By 
Caratheodory's theorem the sets Ci cover the convex hull of Axq,...,Ax^ and A^. 
In particular, iTAiQ) G Cj for some j G {0,...,k}, so 7Ta{Q) = ^i^^j K^x, + >^'jAx. 
By the same argument as above it follows that Aq = ■ ■ ■ = A'^ = Therefore, 
A, = {k + 1)71a{Q) - A,^ = A,^. 

Let ~ be the equivalence relation on X defined by x ~ ?/ if and only if A^ = Ay, and 
let X' = {X^, . . . , ) be the corresponding partition into equivalence classes. Then 
A^' < A; + 1 by what was shown until now. From dim(£^) = dim(MA) one concludes 
N' = k + 1, and is a simplex of dimension k. By Lemma fT5l S equals the partition 
model of X'. Lemma [T^ implies that the coarseness of X' equals which must be an 
integer. Furthermore, X' is homogeneous. 

It remains to prove maxD^ > log{N/{k + 1)) in the case £ ^ "Hi. Let Ps be the rl- 
projection of the uniform distribution, and let A/i be the set of probability distributions 
that rJ-project to Pe. The function is convex on A/i, hence is maximal at the 
vertices of A/i. Let P be a vertex of A/i. Assume that v E M satisfies supp(w) C supp(P). 
Then there exists e > such that P ± ev E A/i and P = ^{P + ev) + ^{P — ev) . Hence 
V = 0. Therefore, the set {A^ : P{x) > 0} is linearly independent. In particular 
I supp(P)| < dim{X) + 1. 

Denote by Si the exponential family with uniform reference measure and with the 
same normal space as S. On A/i the difference 

6{P) := De{P) - De,{P) = - ^ P(x) log P^la;) - logA^ 

is an afiine function that is positive at the uniform distribution. Hence there is a vertex P 
ofArisuchthat5(P) > 0, andsoP)f(P) > D£^{P) = log N-H{P) >\og{N/{k+l)). □ 

The value of DN,k is unknown when k + 1 does not divide A^. The situation is known 
for A^ = 3, see Example [271 If 1 < A; < 3, then Djsf^k = log(2), and all dimension DMA- 
optimal exponential families that contain the uniform distribution are partition models. 
The following conjecture generalizes this example and Theorems [2D] and [251 

Conjecture 29. Dj^^ = losffc^l' ^'^^ dimension Dj^f^^- optimal exponential families 
containing the uniform distribution are partition models. 

The following weaker statement holds: 

Lemma 30. Let X' = {A"^, . . . , A*^'} he a partition of coarseness c < N such that X^ 
has cardinality I < c and all other components X'^ for i > 1 have cardinality c. Then the 
partition model S of X' is log{c) -inclusion optimal. 

Proof. The fact that maxD^ = log(c) follows from Lemma [161 It remains to prove the 
optimality. Let S' S he an exponential family contained in S. Let Z be the union 
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of all blocks of X' of cardinality c. Assume that there exists a probability measure 
Q E S \ S' with support contained in Z. By Corollary [T71 there exists P G P{2) such 
that Q = Pe and D{P\\Q) = log(c). Let Q' = P^' G £. Then _p(P||Q') = -D(P||Q) + 
^(<5IIQ') > log(c) by the Pythagorean identity. Otherwise, if £ n P{Z) = £' n P(Z), 
then dim(^) = dim(^ n P(3^)) + 1 = dim(^' n P(3^)) + 1 < dim(^'), so £" = £'. □ 

Theorem [28] can be applied to the hierarchical models Sk for K C [n] introduced 
in Remark [THl By Theorem the hierarchical model Sk is dimension optimal with 
maxD{-\\£K) = J2ie[n]\K^'^S(-^i) ■ = 2, then the choice K = {1, . . . ,n — 1} yields 

an exponential family of dimension less than |A'|/2 such that maxZ)(-||£^ii-) = log(2), and 
Theorem [19] implies that £k is dimension optimal. The following proposition says that 
the exponential families £k are the unique dimension D-optimal hierarchical models for 
many values of D. 

Proposition 31. Let X = Xi x ■ ■ ■ x Xn, where Ni = \Xi\ < oo. For any K C [n] let 

Dk = Xlj^A' The hierarchical model Sk is dimension DK-optimal. 

Let I be any divisor of N := \X\ = YYi=i-^i- ^ ^''^V hierarchical model that is 
dimension \og{N /I) -optimal, then there is a subset K C [n] such that £ = Sk- 

The proposition implies that if / is not of the form Hiei^ some subset K 'O [n], 

then there exists no hierarchical model that is dimension log (iV//) -optimal. 

Proof. It only remains to prove the last statement. If S satisfies the assumptions, then S 
is a partition model by Theorem [23 Therefore, it suffices to prove that any hierarchical 
model that is also a partition model is of the form £k- 

Let A be a simplicial complex on [n] such that S = Sa, and let K = UjgA</- Then S is 
a submodel of Sk- Let A be a sufficient statistics of S. By Lemma [7]the convex supports 
of S and Sk have the same number of vertices. By Lemma [15] both are simplices, hence 
they have the same dimension, so S = Sk- □ 

6 Discussion 

Conjecture [29] would imply that the partition models of Lemma [30] are dimension optimal 
among all exponential families. If the conjecture were true, then it would suggest the 
following interpretation: In many cases the information divergence D{P\\Q) can be 
interpreted as the information which is lost when P is the true probability distribution, 
but computations are carried out with Q. For example, in the case of the independence 
model Si of two variables, D^^ equals the mutual information and measures the amount 
of information that one variable carries about the other variable. If a probability measure 
is replaced by its r/-projection, then this information is lost. 

For the exponential families Sk the loss equals Dk = J2i^K logl^i); which is precisely 
the maximal information that the random variables that are not in K can carry. Assum- 
ing that the conjecture is true, if the model is smaller than Sk, then, in general, more 
information can be lost. In this interpretation the fact that maxZ^^; > log(2) unless 
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£ = P{X)° means that for any exponential family S ^ P(X)° in general at least one bit 
is necessary to compensate the approximation of arbitrary probability measures. 
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